81 research outputs found

    Cohesion and translation variation: Corpus-based analysis of translation varieties

    Get PDF
    In this study, we analyse cohesion in human and machine translations that we call `translation varieties' as defined by Lapshinova-Koltunski (2017) -- translation types differing in the translation methods involved. We expect variation in the distribution of different cohesive devices which occur in translations. Variation in translation can be caused by different factors, e.g. by systemic contrasts or ambiguities in both source and target languages. It is known that variation in English-to-German translations depends on devices of cohesion involved. We extract quantitative evidence for cohesive devices from a corpus and analyse them with descriptive techniques to see where the differences lie. We include not only English-German translation into our analyses, but also also English and German non-translated texts, representing the source and the target language. Similarities and differences between translated and non-translated texts could provide us with the information on the original of this variation, which might be caused by translationese features

    Linguistic features of genre and method variation in translation: A computational perspective

    Get PDF
    From The Grammar of Genres and Styles - From Discrete to Non-Discrete Units. Edited by Legallois, D., Charnois, T. and Larjavaara, M.In this contribution we describe the use of text classification methods to investigate genre and method variation in an English - German translation corpus. For this purpose we use linguistically motivated features representing texts using a combination of part-of-speech tags arranged in bigrams, trigrams, and 4-grams. The classification method used in this study is a Bayesian classifier with Laplace smoothing. We use the output of the classifiers to carry out an extensive feature analysis on the main difference between genres and methods of translation

    ParCorFull2.0: a Parallel Corpus Annotated with Full Coreference

    Get PDF
    In this paper, we describe ParCorFull2.0, a parallel corpus annotated with full coreference chains for multiple languages, which is an extension of the existing corpus ParCorFull (Lapshinova-Koltunski et al., 2018). Similar to the previous version, this corpus has been created to address translation of coreference across languages, a phenomenon still challenging for machine translation (MT) and other multilingual natural language processing (NLP) applications. The current version of the corpus that we present here contains not only parallel texts for the language pair English-German, but also for English-French and English-Portuguese, which are all major European languages. The new language pairs belong to the Romance languages. The addition of a new language group creates a need of extension not only in terms of texts added, but also in terms of the annotation guidelines. Both French and Portuguese contain structures not found in English and German. Moreover, Portuguese is a pro-drop language bringing even more systemic differences in the realisation of coreference into our cross-lingual resources. These differences cause problems for multilingual coreference resolution and machine translation. Our parallel corpus with full annotation of coreference will be a valuable resource with a variety of uses not only for NLP applications, but also for contrastive linguists and researchers in translation studies.Christian Hardmeier and Elina Lartaud were supported by the Swedish Research Council under grant 2017-930, which also funded the annotation work of the French data. Pedro Augusto Ferreira was supported by FCT, Foundation for Science and Technology, Portugal, under grant SFRH/BD/146578/2019

    ParCorFull: a Parallel Corpus Annotated with Full Coreference

    Get PDF
    ParCorFull is a parallel corpus annotated with full coreference chains that has been created to address an important problem that machine translation and other multilingual natural language processing (NLP) technologies face -- translation of coreference across languages. Our corpus contains parallel texts for the language pair English-German, two major European languages. Despite being typologically very close, these languages still have systemic differences in the realisation of coreference, and thus pose problems for multilingual coreference resolution and machine translation. Our parallel corpus covers the genres of planned speech (public lectures) and newswire. It is richly annotated for coreference in both languages, including annotation of both nominal coreference and reference to antecedents expressed as clauses, sentences and verb phrases. This resource supports research in the areas of natural language processing, contrastive linguistics and translation studies on the mechanisms involved in coreference translation in order to develop a better understanding of the phenomenon

    Cross-lingual Incongruences in the Annotation of Coreference

    Get PDF
    In the present paper, we deal with incongruences in English-German multilingual coreference annotation and present automated methods to discover them. More specifically, we automatically detect full coreference chains in parallel texts and analyse discrepancies in their annotations. In doing so, we wish to find out whether the discrepancies rather derive from language typological constraints, from the translation or the actual annotation process. The results of our study contribute to the referential analysis of similarities and differences across languages and support evaluation of cross-lingual coreference annotation. They are also useful for cross-lingual coreference resolution systems and contrastive linguistic studies

    DiHuTra: a parallel corpus to analyse differences between human translations

    Get PDF
    This paper describes a new corpus of human translations which contains both professional and students translations. The data consists of English sources — texts from news and reviews — and their translations into Russian and Croatian, as well as of the subcorpus containing translations of the review texts into Finnish. All target languages represent mid-resourced and less or mid-investigated ones. The corpus will be valuable for studying variation in translation as it allows a direct comparison between human translations of the same source texts. The corpus will also be a valuable resource for evaluating machine translation systems. We believe that this resource will facilitate understanding and improvement of the quality issues in both human and machine translation. In the paper, we describe how the data was collected, provide information on translator groups and summarise the differences between the human translations at hand based on our preliminary results with shallow features

    Computational analysis of different translations: by professionals, students and machines

    Get PDF
    In this work, we analyse translated texts in terms of various features. We compare two types of human translations, professional and students’, and machine translation (MT) outputs in terms of lexical and grammatical variety, sentence length, as well as frequencies of different part-of-speech (POS) tags and POS-trigrams. Our analyses are carried out on parallel translations into Croatian, Finnish and Russian, all originating from the same source English texts. Our results indicate that machine translations are the closest to the source text, followed by student translations. Also, student translations are sometimes more similar to MT than to professional translations. Furthermore, we identify sets of features distinctive for machine translations

    Using ChatGPT as a CAT tool in Easy Language translation

    Full text link
    This study sets out to investigate the feasibility of using ChatGPT to translate citizen-oriented administrative texts into German Easy Language, a simplified, controlled language variety that is adapted to the needs of people with reading impairments. We use ChatGPT to translate selected texts from websites of German public authorities using two strategies, i.e. linguistic and holistic. We analyse the quality of the generated texts based on different criteria, such as correctness, readability, and syntactic complexity. The results indicated that the generated texts are easier than the standard texts, but that they still do not fully meet the established Easy Language standards. Additionally, the content is not always rendered correctly
    corecore